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Abstract 

Wc consider a joint processing of n independent sparse regression 
problems. Each is based on a sample (yn,xu) . . . , (y,- m , Xi m ) of m i.i.d. 
observations from yn = xji/3i+en, yn € M, Xu € MP, i = 1, . . . , n, and 
En ~ 7V(0,ct 2 ), say. p is large enough so that the empirical risk min- 
imizer is not consistent. We consider three possible extensions of the 
lasso estimator to deal with this problem, the lassoes, the group lasso 
and the RING lasso, each utilizing a different assumption how these 
problems are related. For each estimator we give a Bayesian interpre- 
tation, and we present both persistency analysis and non-asymptotic 
error bounds based on restricted eigenvalue - type assumptions. 

". . . and only a star or two set sparsedly in the vault of heaven; and you 
will find a sight as stimulating as the hoariest summit of the Alps." R. L. 
Stevenson 

1 Introduction 

We consider the model 

Y i = Xjp i + e u i = l,...,n, (1) 



or more explicitly 

Vij = x ljPi i = l,...,n, j = l,...,m 

where /3j € MP, Xi G R mx P is either deterministic fixed design matrix, or 
a sample of m independent MP random vectors. Generally, we think of 
j indexing replicates (of similar items within the group) and i indexing 
groups (of replicates). Finally, ey, i = 1, . . . ,n, j = 1, . . . , m are (at least 
uncorrelated with the xs), but typically assumed to be i.i.d. sub-Gaussian 
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random variables, independent of the regressors Xij. We can consider this as 
n partially related regression models, with m i.i.d. observations on the each 
model. For simplicity, we assume that all variables have expectation 0. The 
fact that the number of observations does not dependent on i is arbitrary 
and is assumed only for the sake of notational simplicity. 

The standard FDA (functional data analysis) is of this form, when the 
functions are approximated by their projections on some basis. Here we 
have n i.i.d. random functions, and each group can be considered as m 
noisy observations, each one is on the value of these functions at a given 
value of the argument. Thus, 

Vij = 9i{zij) + Sij, (2) 

where Zij € [0,1]. The model fits the regression setup of (1), if g(z) = 
J2 P e=i Pi h e(p) where h u - - - , h p are in L 2 (0, 1), and x ij£ = h e (zij). 

This approach is in the spirit of the empirical Bayes approach (or com- 
pound decision theory, note however that the term "empirical Bayes" has a 
few other meanings in the literature), cf, [11, 12, 8]. The empirical Bayes to 
sparsity was considered before, e.g., [15, 3, 7, 6]. However, in these discus- 
sions the compound decision problem was within a single vector, while we 
consider the compound decision to be between the vectors, where the vec- 
tors are the basic units. The beauty of the concept of compound decision, 
is that we do not have to assume that in reality the units are related. They 
are considered as related only because our loss function is additive. 

One of the standard tools for finding sparse solutions in a large p small 
m situation is the lasso (Tibshirani [13]), and the methods we consider are 
its extensions. 

We will make use of the following notation. Introduce l pq norm of a set 
of vectors z\, . . . , z n , not necessarily of the same length, Zy, i = 1, . . . , n, 
j = 1,... , Jf 

Definition 1.1 \\z 

These norms will serve as a penalty on the size of the matrix B = (fix, . . . , f3 n ). 
Different norms imply different estimators, each appropriate under different 
assumptions. 

Within the framework of the compound decision theory, we can have 
different scenarios, and we consider three of them. In Section 2 we investi- 
gate the situation when there is no direct relationship between the groups, 
and the only way the data are combined together is via the selection of the 
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common penalty. In this case the sparsity pattern of the solution for each 
group are unrelated. We argue that the alternative formulation of the lasso 
procedure in terms of £2,1 (or, more generally, £ a ,i) norm which we refer to 
as "lassoes" can be more natural than the simple lasso, and this is argued 
from different points of view. 

The motivation is as follows. The lasso method can be described in two 
related ways. Consider the one group version, yj = xj (3 + Sj. The lasso 
estimator can be defined by 

m 

Minimize ^(yj - xJ/3) 2 s.t. ||/3||i < A. 

5=1 

An equivalent definition, using Lagrange multiplier is given by 

m 

Minimize J^iVj ~ ' P? + M\P\\l, 
i=i 

where a can be any arbitrarily chosen positive number. In the literature 
one can find almost only a = 1. One exception is Greenshtein and Ritov [5] 
where a = 2 was found more natural, also it was just a matter of aesthetics. 
We would argue that a > 2 may be more intuitive. Our first algorithm 
generalizes this representation of the lasso directly to deal with compound 
model (1). 

In the framework of the compound decision problem it is possible to 
consider the n groups as repeated similar models for p variables, and to 
choose the variables that are useful for all models. We consider this in 
Section 3. The relevant variation of the lasso procedure in this case is group 
lasso introduced by Yuan and Lin [14]: 

n m 

Minimize J^fc/y ~ x ljPi? + MWh,l- (3) 
i=l j=l 

The authors also showed that in this case the sparsity pattern of variables is 
the same (with probability 1). Non- asymptotic inequalities under restricted 
eigenvalue type condition for group lasso are given by Lounici et al. [10]. 

Now, the standard notion of sparsity, as captured by the Lq norm, or by 
the standard lasso and group lasso, is basis dependent. Consider the model 
of (2). If, for example, g(z) = l(a < z < b), then this example is sparse 
when ht(z) = ±(z > £/p). It is not sparse if hg(z) = (z — l/p) + ■ On the 
other hand, a function g which has a piece-wise constant slope is sparse in 
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the latter basis, but not in the former, even though, each function can be 
represented equally well in both bases. 

Suppose that there is a sparse representation in some unknown basis, 
but assumed common to the n groups. The question arises, can we recover 
the basis corresponding to the sparsest representation? We will argue that 
this penalty, also known as trace norm or Schatten norm with p = 1, aims in 
finding the rotation that gives the best sparse representation of all vectors in- 
stantaneously (Section 4). We refer to this method as the rotation-invariant 
lasso, or shortly as the RING lasso. This is not surprising as under some 
conditions, this penalty also solves the minimum rank problem (see Candes 
and Recht [4] for the noiselss case, and Bach [1] for some asymptotic results). 
By analogy with the lassoes argument, a higher power of the trace norm as 
a penalty may be more intuitive to a Bayesian. 

For both procedures considered here, the lassoes and the RING lasso, we 
present the bounds on their persistency as well as non-asymptotic inequali- 
ties under restricted eigenvalues type condition. All the proofs are given in 
the Appendix. 

2 The lassoes procedure 

The minimal structural relationship we may assume is that the s are not 
related, except that we believe that there is a bound on the average sparsity 
of the /3's. One possible approach would be to consider the problem as a 
standard sparse regression problem with nm observations, a single vector of 
coefficients (3 = {01, . . . , /3j) T , and a block diagonal design matrix X. This 
solution imposes very little on the similarity among fti, . . . , /3 n . The lassoes 
procedure discussed in this section assume that these vectors are similar, at 
least in their level of sparsity. 

2.1 Prediction error minimization 

In this paper we adopt an oracle point of view. Our estimator is the empirical 
minimizer of the risk penalized by the complexity of the solution (i.e., by its 
l\ norm) . We compare this estimator to the solution of an "oracle" who does 
the same, but optimizing over the true, unknown to simple human beings, 
population distribution. 

We assume that each vector of i = 1, . . . , n, solves a different prob- 
lem, and these problems are related only through the joint loss function, 
which is the sum of the individual losses. To be clearer, we assume that 
for each i = 1, . . . , n, Zij = (y^-, xjA T , j = 1, . . . , m are i.i.d., sub-Gaussian 
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random variables, drawn from a distribution Q{. Let Z{ = (yi,xJ) T be an 
independent sample from Qi. For any vector a, let a = (— l,a T ) T , and let 
Ej be the covariance matrix of Zi and (3 = (Si, . . . , S n ). The goal is to find 
the matrix B = (/3i, . . . , f3 n ) that minimizes the mean prediction error: 

n n 

L(B,6) = Y, E QM ~ ^A) 2 = J2^fo- (4) 

i=l i=l 

For p small, the natural approach is empirical risk minimization, that is 
replacing Sj in (4) by Si, the empirical covariance matrix of Zj. However, 
generally speaking, if p is large, empirical risk minimization results in overfit- 
ting the data. Greenshtein and Ritov [5] suggested (for the standard n = 1) 
minimization over a restricted set of possible /3's, in particular, to either 
L\ or Lq balls. In fact, their argument is based on the following simple 
observations 

Si)p\ < ||Ei-5i||oo||#|i 
and (5) 

\\t i -S l \\ oo = p (m~ 1 / 2 logp) 

(see Lemma A.l in the Appendix for the formal argument.) 

This leads to the natural extension of the single vector lasso to the com- 
pound decision problem set up, where we penalize by the sum of the squared 
L\ norms of vectors . . . , /3 n , and obtain the estimator defined by: 



n n 

0i, ...J n ) = argmin|m^ PjSifc + X n ^ HAH?} 



P\,-,Pn i=l i=l 

n m 

argmin^l^^- - x\ftif + A n ||ft||f|. 

Pl,-,Pn i=l j=l 



(6) 



The prediction error of the lassoes estimator can be bounded in the 
following way. In the statement of the theorem, c n is the minimal achievable 
risk, while C n is the risk achieved by a particular sparse solution. 

Theorem 2.1 Let fyo, i = 1, . . . ,n be n arbitrary vectors and let C n = 
n_1 Th=i P]b%Pio- Let c n = n _1 Yh=i m[n P Then 

n n . n , n 

Pi%Pi < £ A T o^ft + (— + 5 n ) J2 -{—- s n ) £ ii&iii. 

Z — ✓ L — «■ m L — ✓ m L — J 

i=l i=l i=l i=l 
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where 6 n = maxj 1 1 ^ — 1 1 oo - If also X n /m — > and A n /(m 1 / 2 log(rep)) — > oo, 
then 

J2m\i = O p (mn^-^) + (1 + log(np))) £ llftoll! (7) 

i=l i=l 

and 

n n , n 

Y^'pltik < J2^Ao + (i + °p(i))-^E ii^ Hi- 

The result is meaningful, although not as strong as may be wished, as 
long as C n — c n — > 0, while n~ l X^=i IIAolli = o p (m 1//2 ). That is, when 
there is a relatively sparse approximations to the best regression functions. 
Here sparse means only that the L\ norms of vectors is strictly smaller, on 
the average, than y/m. Of course, if the minimizer of /3 T Sj/3 itself is sparse, 

then by (7) $±, . . . , n are as sparse as the true minimizers . 

Also note, that the prescription that the theorem gives for selecting A n , 
is sharp: choose A n as close as possible to m5 n , or slightly larger than \pm. 



2.2 A Bayesian perspective 

The estimators . . . ,/3 m look as if they are the mode of the a-posteriori 
distribution of the /3j's when yij\/3i ~ N(xjj(3i,a 2 ), the /3\,...,/3 n are a priori 
independent, and has a prior density proportional to exp(— A n ||/3j||f/(T 2 ). 
This distribution can be constructed as follows. Suppose T, ~ N(0, A" 1 ^ 2 ). 
Given Tj, let un, . . . ,Ui P be distributed uniformly on the simplex {uu > 
0, Yle=i U H = Let Sit, ... , Sip be i.i.d. Rademacher random variables 

(taking values ±1 with probabilities 0.5), independent of T{,un, . . . ,Ui p . 
Finally let = uusu, 1 = 1,..., p. 

However, this Bayesian point of view is not consistent with the conditions 
of Theorem 2.1. An appropriate prior should express the beliefs on the 
unknown parameter which are by definition conceptually independent of 
the amount data to be collected. However, the permitted range of A n does 
not depend on the assumed range of but quite artificially should be in 
order between m 1//2 and m. That is, the penalty should be increased with 
the number of observations on although in a slower rate than m. In fact, 
even if we relax what we mean by "prior" , the value of \ n goes in the 'wrong' 
direction. As m — > oo, one may wish to use weaker a-priori assumptions, 
and permits T to have a-priori second moment going to infinity, not to 0, as 
entailed by A n — > 0. 
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We would like to consider a more general penalty of the form Y27=i II A 111 ■ 
A power a 7^ 1 of i\ norm of (3 as a penalty introduces a priori dependence 
between the variables which is not the case for the regular lasso penalty with 
a = 1, where all fyj are a priori independent. As a increases, the sparsity of 
the different vectors tends to be the same. Note that given the value of A n , 
the n problems are treated independently. The compound decision problem 
is reduced to picking a common level of penalty. When this choice is data 
based, the different vectors become dependent. This is the main benefit of 
this approach — the selection of the regularization is based on all the ran 
observations. 

For a proper Bayesian perspective, we need to consider a prior with much 
smaller tails than the normal. Suppose for simplicity that c n = C n (that is, 
the "true" regressors are sparse), and maxj ||/3jo||i < 00. 

Theorem 2.2 Let fyo be the minimizer of /? T £j/3. Suppose maxj ||/3io||i < 
00. Consider the estimators: 

n n 

0i, ...J n ) = argmin{m £ ft Sik + A n ^ II All?} 

Pl,—,Pn i=l i=l 

for some a > 2. Assume that X n = Q(m5 m ) = ©(to 1 / 2 logp). Then 

n 

n^Y,\\k\i = 0((m5 n /X n f^), 

i=l 

and 

n n 

Y^M^k < ^Ao + O p (n(TO/A n ) 2/(a - 2) ^ /( ^ 2) ). 

i=l i=l 

Remark 2.1 If the assumption A n = Q(m5 m ) does not hold, i.e. if m5 m /\ n = 
o(l), then the error term dominates the penalty and we get similar rates as 
in Theorem 2.1, i.e. 

1=1 

and 

n _ n 

M%h < Mo^o + Op (nXn/m) ■ 

i=l i=l 
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Note that we can take in fact A n — > 0, to accommodate an increasing 
value of the /3j's. 

The theorem suggests a simple way to select X n based on the data. Note 
that n~ l Y^i=i HAIIl * s a decreasing function of A. Hence, we can start with 
a very large value of A and decrease it until n" 1 Yh=i II A 111 ~ A _2 / a . 

2.3 Restricted eigenvalues conditions and non-asymptotic in- 
equalities 

Before stating the conditions and the inequalities for the lassoes procedure, 
we introduce some notation and definitions. 

For a vector (3, let be the cardinality of its support: A4((3) = 

Y^i 1(A 0). Given a matrix A G M nxp and given a set J = {JJ, Jj C 
{1, . . . ,p}, we denote Aj = {Ajj, i = 1, . . . , n, j G Jj}. By the complement 
J c of J we denote the set {Jf, . . . , J^}, i.e. the set of complements of Jj's. 
Below, AT is np x m block diagonal design matrix, X = diag(ATi , X2 , . . . , X n ) , 
and with some abuse of notation, a matrix A = (Ai,...,A n ) may be 
considered as the vector (Aj , . . . , Aj[) T . Finally, recall the notation B = 
(/3i,...,/?„) 

The restricted eigenvalue assumption of Bickel et al. [2] (and Lounici 
et al. [10]) can be generalized to incorporate unequal subsets JjS. In the 
assumption below, the restriction is given in terms of l qt \ norm, q ^ 1. 

Assumption RE g (s, cq, k). 

( IIX T AII 1 
K = mm{ '' I,.', : max^ < s, A G R nxp \ {0}, ||Ajc|| ffjl < co||Aj|| ffjl \ > 0. 
[^m\\Aj\\ 2 i J 

We apply it with q = 1, and in Lounici et al. [10] it was used for q = 2. We 
call it a restricted eigenvalue assumption to be consistent with the literature. 
In fact, as stated it is a definition of k as the maximal value that satisfies 
the condition, and the only real assumption is that k is positive. However, 
the larger k is, the more useful the "assumption" is. Discussion of the 
normalisation by \Jm can be found in Lounici et al. [10]. 

For penalty 1 1 Alii) we have the following inequalities. 

Theorem 2.3 Assume yij ~ Af(xjj(3i,a 2 ), and let $ be a minimizer of (6), 
with 

^ AAay m log(np) 
^ amax(B a - 1 ! 5 a - 1 )' 
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where a ^ 1 and A > v2 ; -B ^ maxj ||/3j||i and B ^ maxj ||/3j||i, max(i?,5) > 
(B may depend on n,m,p, and so can B). Suppose that generalized as- 
sumption REi(s,3, k) defined above holds, 'Y^T=x x ijt. = m f or a ^ h^i an ^ 
M(/3i) < s for all i. 

Then, with probability at least 1 — {np) l ~ A2 1 2 , 

(a) The root means squared prediction error is bounded by: 



1 



inm 



\X T (B-B)\\ 2 < 



K\ m 



3a\ 
2~Jm 



max^"- 1 ,^- 1 ) + 2Aa v / log(np) 



(b) The mean estimation absolute error is bounded by: 

3a\ 



4s 

-\\B-B\h < 



n 



max(B a-1 , B a - X ) + 2Aa^mlog(np) 



(c) If HlftH?" 1 - 6 Q -V2)| ^ for some 5 > 0, 

M{h) < wuPi-mh — : — m ^- max 



AaHftH?- 1 / 2 - Aa^m\og{np) 
where </>i imax is the maximal eigenvalue of XjXi/m. 



Note that for a = 1, if we take A = 2Ao\Jm log(np), the bounds are of 
the same order as for the lasso with np-dimensional (3 ( up to a constant of 
2, cf. Theorem 7.2 in Bickel et al. [2]). For a > 1, we have dependence of 
the bounds on the l\ norm of (5 and /3. 

We can use bounds on the norm of (3 given in Theorem 2.2 to obtain the 
following results. 

Theorem 2.4 Assume yij ~ Af(xjj(3i,a 2 ), with maxj ^ b where b > 

can depend on n,m,p. Take some n £ (0, 1). Let j3 be a minimizer of (6), 
with ^ 

A = , y/mlog(np), 
ab a 1 

A > y/2, such that b > C 2 ^ -1 )) f r some constant c > 0. Also, assume 
that C n — c n = Q(m5 n ), as defined in Theorem 2.1. 

Suppose that generalized assumption RE\(s,3, k) defined above holds, 
Sj=i x ije = m f or &ll h an d M-ifii) ^ s for all i. 

Then, for some constant C > 0, with probability at least 1— (^i] + (np) 1 ~ A2 ^ 2 ^j , 
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(a) The prediction error can be bounded by: 



\X l (B-B)\\i ^ 



2 4A a snlog(np) 



1 + 3C 



(o-l)/(a-2)- 



(b) The estimation absolute error is bounded by: 



\B-B\U < 



2Aasny/\og(np) 



K 2 \/m 



1 + 3C 



b ^(a-l)/(a-2) 



(c) Average sparsity of 
1 



n 



i=i 



c 2 5 2 



, v 1+1/(0-2)' 

1 + 3C1 ^) 



where 4> max is the largest eigenvalue of X T X/m. 

This theorem also tells us how large l\ norm of f3 can be to ensure good 
bounds on the prediction and estimation errors. 

Note that under the Gaussian model and fixed design matrix, assumption 
C n — c n = 0(m8 n ) is equivalent to \\B\fy ^ Cm5 n . 



3 Group LASSO: Bayesian perspective 

Group LASSO is defined (see Yuan and Lin [14]) by 



(ft, . . . ,ft) = argmin 



n m p n 

EBi*-4A) a +A£{E/$} 



1/2 



i=l j=l 



=1 i=l 



(8) 



Note that (ft, . . . , ft) are defined as the minimum point of a strictly convex 
function, and hence they can be found by equating the gradient of this 
function to 0. 

Recall the notation B = (ft, . . . , ft) = {bj, bJ) T . Note that (8) is 
equivalent to the mode of the a-posteriori distribution when given B, Yij, 
i = 1, . . . ,n, j = 1, . . . , m, are all independent, yij | B ~ AA(x^- ft, ct 2 ), and 
a-priori, E>i, . . . , b p , are i.i.d., 



f b {be) oc exp{— A||b^|| 2 }, £ = l,...,p, 
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where A = A/(2cr 2 ). We consider now some property of this prior. For 
each £, be have a spherically symmetric distribution. In particular they are 
uncorrelated and have mean 0. However, they are not independent. Change 
of variables to a polar system where 

Re = HMk 

Pa = Rw a , w e e § n ~\ 

where S n_1 is the sphere in W 1 . Then, clearly, 

f(R e , w e ) = C n . x R n ,- l e-~ XR t, R t > 0, (9) 

where C„ jA = A n r(n/2)/2r(n)W 2 . Thus, R e , wg are independent Re ~ 
r(n, A), and wj> is uniform over the unit sphere. 

The conditional distribution of one of the coordinates of be, say the first, 
given the rest has the form 

n 



f(b a \b i2 , be n , £ b l = P 2 ) K e-Wi+^i/^ 2 

i=2 

which for small bei/p looks like the normal density with mean and variance 
p/X, while for large bei/p behaves like the exponential distribution with 
mean A -1 . 

The sparsity property of the prior comes from the linear component of 
log-density of R. If A is large and the Ys are small, this component dominates 
the log-a-posteriori distribution and hence the maximum will be at 0. 

Fix now £ £ {l,...,p}, and consider the estimating equation for be 
- the £ components of the /3's. Fix the rest of the parameters and let 
^ijl = Vij - Y^k^e Pik%ijk- Then b lh i = 1, . . . , n, satisfy 

m Ab 
= - \] Xije{Y? e - beiXije) H . % , % = 1, - - - , n 

(k 



j =1 VEfeb 



Hence 



^2 •'•-;,/ i y/j, - beiXije) + A|b«, say. 
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The estimator has an intuitive appeal. It is the least square estimator of ba, 
Y^j=i x ij^ije/ Sjli x %i-> P une d to 0. It is pulled less to zero as the variance 
of b(\ , . . . , b£ n increases (and A| is getting smaller) , and as the variance of 
the LS estimator is lower (i.e., when 2~_^j=i x %i * s larger). 

If the design is well balanced, Ylj=i x %t = m ' then we can characterize 
the solution as follows. For a fixed Z, ba,-, b^ n are the least square solution 
shrunk toward by the same amount, which depends only on the estimated 
variance of ba, ■ ■ ■ , be n - In the extreme case, b \\ = ■ ■ ■ = bg n = 0, otherwise 
(assuming the error distribution is continuous) they are shrunken toward 0, 
but are different from 0. 

We can use (10) to solve for 

A V _ II £ ii2 _ TJj=\ x ^?ji \ 



\* J 11 1112 Z^l X* 4- V m r 2 



Hence \* is the solution of 



A 2 =£(|5f^y. (id 

i= i \ A i + 2_j=i x iji J 

Note that the RHS is monotone increasing, so (11) has at most a unique 
solution. It has no solution if at the limit — > oo, the RHS is still less than 
A 2 . That is if 



n m 

i=l j=l 



n iii 



2 



then bg = 0. In particular if 

A 2 > ^2(j2 x ij£Yiji) , 1=1,... ,p 

7=1 3=1 

Then all the random effect vectors are 0. In the balanced case the RHS is 
O p (mnlog(p)). By (9), this means that if we want that the estimator will 
be if the underlined true parameters are 0, then the prior should prescribe 
that bi has norm which is o(m _1 ). This conclusion is supported by the 
recommended value of A given, e.g. in [10]. 

Non-asymptotic inequalities and prediction properties of the group lasso 
estimators under restricted eigenvalues conditions are given in [10]. 
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4 The RING lasso 

The rotation invariant group (RING) lasso is suggested as a natural exten- 
sion of the group lasso to the situation where the proper sparse description 
of the regression function within a given basis is not known in advance. 
For example, when we prefer to leave it a-priori open whether the function 
should be described in terms of the standard Haar wavelet basis, a collection 
of interval indicators, or a collection of step functions. All these three span 
the same linear space, but the true functions may be sparse in only one of 
them. 

4.1 Definition 

Let A = ^CiXixJ, be a positive semi-definite matrix, where x±,X2,--- is 
an orthonormal basis of eigenvectors. Then, we define A" 1 = Y c l x i x J . We 
consider now as penalty the function 

n 

|||H|||i =trace{ PiPj) 1/2 }, 
i=i 

where B = (/3i, . . . , /3 n ) = (bj , . . . , bJ) T . This is also known as trace norm 

1/2 

or Schatten norm with p = 1. Note that |||£>|||i = Yl c i where c\, . . . ,c p 
are the eigenvalues of BB T = YL7=iPiPi (including multiplicities), i.e. this 
is the l\ norm on the singular values of B. |||i3|||i is a convex function of B. 
In this section we study the estimator defined by 

n 

B = argmin{V(^ - xj^f + A|| |^|| K-> (12) 

We refer to this problem as RING (Rotation INvariant Group) lasso. 

The lassoes penalty considered primary the columns of B. The main 
focus of the group lasso was the rows. Penalty |||6|||i is symmetric in its 
treatment of the rows and columns since &B = &B T , where 6^4 denotes 
the spectrum of A. Moreover, the penalty is invariant to the rotation of the 
matrix B. In fact, |||jS|||i = |||rB?7|||i, where T and U are n x n and p x p 
rotation matrices: 

(TBU) T (TBU) = U T B J BU 
and the RHS have the same eigenvalues as B T B = Y PiPj ■ 
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The rotation-invariant penalty aims at finding a basis in which f3\ , . . . , j3 n 
have the same pattern of sparsity. This is meaningless if n is small — any 
function is well approximated by the span of the basis is sparse in under 
the right rotation. However, we will argue that this can be done when n is 
large. 

The following lemma describes a relationship between group lasso and 
RING lasso. 

Lemma 4.1 

(i) ll^lb.i > uifj/ew ||lT£>|| 2i i = |||£>|||i ; where U is the set of all unitary 
matrices. 

(ii) There is a unitary matrix U, which may depend on the data, such that 
if Xi,. . . ,X n are rotated by U T , then the solution of the RING lasso 
(12) is the solution of the group lasso in this basis. 

4.2 The estimator 

Let B = 5^!=i a£/3|b| T be the singular value decomposition, or the PCA, of 
B: Pi,..., (3; and b\, . . . , b* are orthonormal sub-bases of MP and M. n respec- 
tively, at > q 2 > • • • , and BB T ^ = a|/3|, B T Bb\ = ap*, £ = 1, . . . ,p A n. 
Let T = X]f=i e ?/^| T (clearly, TT T = I). Consider the parametrization 
of the problem in the rotated coordinates, Xij = Txij and /3j = T/3j. 
Then geometrically the regression problem is invariant: xjjfii = xJkPi, and 
|||S|||i = 1 1 B 1 1 2,1, up to a modified regression matrix. 

The representation B = ^^=i a ?/^|^g T shows that the difficulty of the 
problem is the difficulty of estimating s(n + p) parameters with nm obser- 
vations. Thus it is feasible as long as s/m — > and sp/nm — > 0. 

We have 

Theorem 4.2 Suppose p < n. Then the solution of the RING lasso is 
given by Y2%=i /3|b| T , s = s\ < p, and s\ \ as A — > oo. If s = p then the 
gradient of the target function is given in a matrix form by 

-2R + X(BB T r 1/2 B 

where 

R = (xJiY! - Xxfh), Xj(Y n - Xj n j) . 
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And hence 

That is, the solution of a ridge regression with adaptive weight. 

More generally, let B = Yl\=i Q £/3|b£ T > s < P, where /3*,...,/3* is an 
orthonormal base o/R p . Then the solution satisfies 

pfR = -f3*J{BB T ) +1 / 2 B, f < s 
2 

\(3fRbl\<^, s<Z<p. 

where for any positive semi-definite matrix A, A +l / 2 is the Moore-Penrose 
generalized inverse of A 1 ! 2 . 

Roughly speaking the following can be concluded from the theorem. Suppose 
the data were generated by a sparse model (in some basis). Consider the 
problem in the transformed basis, and let S be the set of non-zero coefficients 
of the true model. Suppose that the design matrix is of full rank within 
the sparse model: XjXi = O(m), and that A is chosen such that A 3> 
\J nm log(np). Then the coefficients corresponding to S satisfy 

/3 5l = {XjX t + ^(BsB^y'xjYi. 

Since it is expected that X(BsB'g) 1 ^ 2 is only slightly larger than O(mlog(np)), 
it is completely dominated by XjXi, and the estimator of this part of the 
model is consistent. On the other hand, the rows of R corresponding to 
coefficient not in the true model are only due to noise and hence each of 
them is O(y'nm). The factor of log(np) ensures that their maximal norm 
will be below A/2, and the estimator is consistent. 



4.3 Bayesian perspectives 

We consider now the penalty for (3^ for a fixed k. Let A = n" 1 Ylk^i PkPj » 
and write the spectral value decomposition n^ l Y^l = \PkPl = J2 c j x j x J 
where {xj} is an orthonormal basis of eigenvectors. Using Taylor expansion 
for not too big we get 

trace((nA + Wj) 1/2 ) « ^trace^ 1 / 2 ) + £ 

7=1 2c i 
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= v^trace^ 1 / 2 ) + ^Pj cJ 1/2 x jX J)^ 
= V^trace^ 1 / 2 ) + ^pjA~ l / 2 pi 

So, this like Pi has a prior of A/"(0, na 2 / XA 1 / 2 ). Note that the prior is only 
related to the estimated variance of P, and A appears with the power of 
1/2. Now A is not really the estimated variance of P, only the variance of 
the estimates, hence it should be inflated, and the square root takes care of 
that. Finally, note that eventually, if /3, is very large relative to nA, then the 
penalty become \\P\\, so the "prior" becomes essentially normal, but with 
exponential tails. 

A better way to look on the penalty from a Bayesian perspective is to 
consider it as prior on the n x p matrix B = (Pi, . . . ,/3 n ). Recall that the 
penalty is invariant to the rotation of the matrix B. In fact, |||/3|||i = 
|||T£>i7|||i, where T and U are n x n and p x p rotation matrices. Now, 
this means that if bi, . . . , b p are orthonormal set of eigenvectors of B T B and 

7ij = bjPi - the PCA of Pi, ...,p n , then \\\B\\\i = £? =1 (E£=i lt$' 2 " 
the RING lasso penalty in terms of the principal components. The "prior" 
is then proportional to e~ H^'U 2 . which is as if to obtain a random B 
from the prior the following procedure should be followed: 

1. Sample n, . . . , r p independently from r(n, A) distribution. 

2. For each j = 1, . . . ,p sample 71 j, . . . , j n j independently and uniformly 
on the sphere with radius rj. 

3. Sample an orthonormal base Xlt---iXp "uniformly". 

4. Construct ft = Y7j=i7ikXk- 

4.4 Inequalities under an RE condition 

The assumption on the design matrix X needs to be modified to account for 
the search over rotations, in the following way. 

Assumption RE2(s,co, k). For some integer s such that 1 ^ s ^ p, and a 
positive number cq the following condition holds: 

I \X~^ A 1 1 

k = min{^ — : V is a linear subspace of M. p , dim(V) ^ s, 

^m\\PvA\\ 2 

AGRP x "\{0},|||(/-Py)A||| 1 ^ CO |||iVA||| 1 }>0, 
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where Py is the projection on linear subspace V. 

If we restrict the subspaces V to be of the form V = ©£ = i(ei fe ), r ^ s 
and (ei) is the linear subspace generated by the standard basis vector ej, 
and change the Schatten norm to £2,1 norm, then we obtain the restricted 
eigen value assumption RE 2 (s,co, k) of Lounici et al. [10]. 

Theorem 4.3 Let t/ij ~ J\f(fij, a 2 ) independent, = xjjf3i, Xij G W, 
Pi G W , i = l,...,n, j = l,...,m, p > 2. Assume that YlJLi x %i = m 
for all i, i. Let assumption RE2(s,3, k) be satisfied for X = (xiji), where 
s = rank(B). Consider the RING lasso estimator = Xj-f3i where B is 
defined by (12) with 



A = 4ay/(A + l) mnp, for some A > 1. 
Then, for large n or p, with probability at least 1 — e ~ An P/ 8 } 

mn k z m 

-\\\B-B\\\i < P a ^ T +^ s VP 
n 



rank(B) ^ s — ^™ ax , 

K 

where <^ max is the maximal eigenvalue of X T X/m. 

Thus we have bounds similar to those of group lasso as a function of the 
threshold A, with s being the rank of B rather than its sparsity. However, 
for RING lasso we need a larger threshold compared to that of the group 

lasso (A GL = Aa^/mn (l + ^pV^, Lounici et al. [10]). 
4.5 Persistence 

We discuss now the persistence of the RING lasso estimators (see Section A.l 
for definition and a general result). 

We focus on the sets which are related to the trace norm which defines 
the RING lasso estimator: 

B n ,p = {B € R nxp : \l\B\W! <6(n,p)}. 
Theorem 4.4 Assume that n > 1. For any F G J^^piV), G B n ^ p and 

p(m,n, P ) = ai . gmin ^( /3 ) ) 
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Figure 1: Component variances and eigenvalues, m = 25, n = 150 



we have 

L F 0) - min L F (j3) < f— + — ^ ( ISeV^T) ^ 

with probability at least 1 — 7?, for any rj G (0, 1). 

Thus, for r/ sufficiently small, the conditions log(np) ^ c p m 3 7? and b ^ 
Cb^nm/p, for some c&, c p > 0, imply that with sufficiently high probability, 
the estimator is persistent. Roughly speaking, b is the number of components 
in the SVD of B (the rank of B, M(/3) after the proper rotation), and if 
m S> log n, then what is needed is that this number will be strictly less 
n i/2 m 3/4^-i/2^ jf ^ rue moc j e i i s sparse, p can be almost as large 

as m 3 / 2 n 1 / 2 . 



4.6 Algorithm and small simulation study 

A simple algorithm is the following: 

1. Initiate some small value of /?!,..., /3 n . Let A = • Fix 
7 G (0, 1], e > 0, k, and c > 1. 

2. For 7 = 1, . . . , n : 

(a) Compute $ = (X/X + XA'V^X] { Vi - X^). 

(b) Update A^A- frfa + 7<*i5 -4^-4 + A A; 
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Figure 2: Lower lip position while repeating 32 times 'Say bob again' 

3 - if E?=i E"=i A* > e ) > * update A <- Ac otherwise A <- A/c. 

4. Return to step 2 unless there is no real change of coefficients. 

To fasten the computation, the SVD was computed only every 10 values 
of i. 

As a simulation we applied the above algorithm to the following simu- 
lated data. We generated random /3i , . . . , /J150 G 1R 150 such that all coordi- 
nates are independent, and fyj ~ A/"(0, e -2 - 7 '/ 5 ). All A^ are i.i.d. A/"(0, 1), 
and i/ij = xJj.Pi + e^, where £jj are all i.i.d. AA(0, 1). The true R 2 obtained 
was approximately 0.73. The number of replicates per value of f3, m, varied 
between 5 to 300. We consider two measures of estimation error: 

eiuiia- Aiioo 

Ei=l IIAIloo 

Er=ipi(A-A)iioo 

Ej=l ||-^iA||oo 

The algorithm stopped after 30-50 iterations. Figure is a graphical pre- 
sentation of a typical result. A summary is given in Table 1. Note that m 



L 



par 



L 



pre 
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has a critical impact on the estimation problem. However, with as little as 5 
observations per i? 150 vector of parameter we obtain a significant reduction 
in the prediction error. 



m 






5 


0.9530 (0.0075) 


0.7349 (0.0375) 


25 


0.7085 (0.0289) 


0.7364 (0.0238) 


300 


0.2470 (0.0080) 


0.5207 (0.0179) 



Table 1: The estimation and prediction error as function of the number of 
observations per vector of parameters Means (and SDK). 

The technique is natural for functional data analysis. We used the data 
LipPos. The data is described by Ramsay and Silverman and can be found in 
http://www.stats.ox.ac.uk/ silverma/fdacasebook/lipemg.html. The origi- 
nal data is given in Figure 2. However we added noise to the data as can be 
seen in Figure 3. The lip position is measured at m = 501 time points, with 
n = 32 repetitions. 

As the matrix X we considered the union of 6 cubic spline bases with, 
respectively, 5, 10, 20, 100, 200, and 500 knots (i.e., p = 841, and X\ does 
not depend on i). A Gaussian noise with a = 0.001 was added to Y. The 
result of the analysis is given in Figure 3. Figure 4 presents the projection 
of the mean path on the first eigen- vectors of X^=i Pifii ' • 

The final example we consider is somewhat arbitrary. The data, taken 
from StatLib, is of the daily wind speeds for 1961-1978 at 12 synoptic meteo- 
rological stations in the Republic of Ireland. As the Y variable we considered 
one of the stations (station BIR). As explanatory variables we considered 
the 11 other station of the same day, plus all 12 stations 70 days back (with 
the constant we have altogether 852 explanatory variables). The analysis 
was stratified by month. For simplicity, only the first 28 days of the month 
were taken, and the first year, 1961, served only for explanatory purpose. 
The last year was served only for testing purpose, so, the training set was 
for 16 years (n = 12, m = 448, and p = 852 ). In Figure 5 we give the 2nd 
moments of the coefficients and the scatter plot of predictions vs. true value 
of the last year. 
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Figure 3: Eigenvalue, coefficient variance and typical observed and smooth 
path. 

A Appendix 

A.l General persistence result. 

A sequence of estimators /3( m,n ' p ) is persistent with respect to a set of dis- 
tributions F™ p for /3 e B n>p , if for any F m ^ n . p £ J^V, 

(/3 {m '"' p) ) - L Fm ^ p ((3* Fm n p ) 4 0, 

where L F (f3) = (nm^Ep Ya=i Z^LiO^j _ x IjPifi F m,n, P is the empirical 
distribution function of n x (p + 1) matrix Z, Z% = (Yi,Xn, . . . ,Xi p ), i = 
l,...,n, observed m times. Here Pp = argmin^ eBn p L Fm n and 
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Figure 4: Projection of the estimated mean path on the 2 first eigen-vectors 
of Ym=i fiifij an d the true mean path. 
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Figure 5: Coefficient 2nd moment and prediction vs. true value of the test 
year. 



stands for a collection of distributions of m observations of vectors 
Z% = (Yi,Xii, . . . , Xip), i = 1, . . . , n. 

Assumption F. Under the distributions of random variables Z in J- njP , 
ink = ZuZ ik satisfy E (max i= i v .. jn max^ jfc =i,... P +i CS*) < V. Denote this set 
of distributions by J-~ n ,p(V)- 

This assumption is similar to one of the assumptions of Greenshtein and 
Ritov (2004). It is satisfied if, for instance, the distribution of Zn has finite 
support and the variance of ZmZ^ is finite. 

Lemma A.l Let F £ T njP (V), and denote Sj = (o~ijk) and Ej = (a^), 

with a ijk = E F ZijZ ik and a ik i = m~ l YJj=i Z \k z\{> , where Z = {Z$) is a 
sample from F m , i = 1, . . . , n, j = 1, . . . , m, I = 1, . . . ,p. 

Let $ be the estimator minimising X^Li X)j=i (Xij ~ Xjjfii) 2 subject to 
f3 G B where B is some subset ofM nxp . 

Then, for any n 6 (0, 1), 



(a) max — Sj| 



'2eVlog(ra(p + l) 2 ) 



=i,...,n y mrj 
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1=1 



nm v mrj 
with probability at least 1 — rj. 

Proof. Follows that of Theorem 1 in Greenshtein and Ritov (2004). 

a) Let (Tike = Ciki + eiki, E{ = (e^). Then, under Assumption F and by 
Nemirovsky's inequality (see e.g. Lounici et al [10]), 

P(max - Sj||oo > -4) 

i 



2elog(w(p+ l) 2 ) 2 , 

E( max max (ZijZ ik - E{ZijZ ik )) 

mA 2 i=i,...,nj,k=i,... P +i J 

2eV \og(n(p + I) 2 ) 



rn 



A 2 



Taking A = y 2e ^iog(n^(p+i) ) p roves ^he g rs ^ p ar t Q f the lemma, 
b) By the definition of j3 and f3* F , 

Ef0) ~ Lp{P* F ) ^ 0, Lp0) — Lp(fi* F ) ^ 0. 

Hence, 



< Lp (£j - L F (/3» = L F (0j - L # (0 
+ ($) - L F ($) + L F (?) - Lp (/3* F ) 



^2 sup \L F (J3)-L P 



Denote 5j = (-1, ^i, . . . , then 

n 

L F (/3) = V^S^ 



nm 
i=l 



where £j?j = (o"j,fc) and Ujjfc = EpZ^Z^- For the empirical distribution 
function F mn determined by a sample ZjjP, i = l,...,n, j = l,...,m, 
£ = 1, . . . ,p, = (<7 ijW ) and <r^ = i ^JLi z \k Z i°i ■ 

Introduce matrix £ with = A. Hence, with probability at least 1 — 77, 
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\L F {fi)-L p 





1 




nm 




1 , 


< 




nm ■ 



8=1 



1=1 



nm y mry 



A. 2 Proofs of Section 2 

Proof of Theorem 2.1. Note that by the definition of /3j and (5). 

n 

mnc„ + X n /, II A Hi 
i=l 

n n 

i=i i=i 
n n 
< m ^ + (A„ + m5 n ) ^ \\k\i 

i=l i=l 

n n n 

<mJ2 PlSifa + An Yl II Aolli + m5 n £ || A ||? 



i=l 
n 



i=l 



i=l 



< m 



^ A T o^Ao + (An + mtn) HAolli + ^ H/3,11? 



8=1 



i=l 



mnC„ + (A„ + m(? w ) y^HAolli + m8 n } ^ 



i=i 

2 



i=l 



i=l 



Comparing the LHS with the RHS of (13), noting that mb n <C A n : 

Ell a II 2 ^ C n — C n A n + m8 n sr~^ .. ~ ||2 



i=i 



8=1 
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By (5) and (6): 

E k^ik < e A T <%A + s n E ii Ail? 

i=l i=l i=l 

n , n , n n 

< ]r + ^ E n^oii? - - E hAii? + E hAii? 

j=l i=l i=l i=l 

n , n , n 

< £ ftj&ft, + + 5„) E IIAoll! - (— - *n) E HAlll 
i=l i=l i=l 

n , n 

<E^% + (—+s n ) Enroll?- 

j=l i=l 

(14) 

The result follows. □ 

Proof of Theorem 2.2. The proof is similar to the proof of Theorem 2.1. 
Similar to (13) we obtain: 



n 

a 



lCn + A n EllAlll 
i=l 

n n 

<m^fi%k + K^\\k\\i 
i=i i=i 
n n n 

< m ksik + \nYl ii An? + m5 « E ii An? 

i=l i=l i=l 

n n n 

< m PlSifa + A n E ll&olli + mS n E II All? 

i=l i=l i=l 

n n n n 

< m^^SiAo + A n El^o||? + m5 n EH^oll? + ^EllAll? 

i=l i=l i=l i=l 

n n n 

= mnc n + A„E IIAolli + mSn^ IIAolli + m8 n ^ \\/3j ||? . 



i=l i=l i=l 

That is, 

n n n 

E(Anii^iii - m<y An?) < An E iiAoiif + m5 « E ii^oiii 

i=l i=l i=l 

= Q(mn5 n ). 
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It is easy to see that the maximum of Y2i=i II A 111 subject to the constraint 

(16) is achieved when = ••• = ||/3 n ||f. That is when ||/?i||f solves 

X n u a — m5 n u 2 = Q(m5 n ). As A n = Q(m5 m ), the solution satisfies u = 
0(mS n /X n ) l ^ 2 \ 

Hence we can conclude from (16) 



Y,m\\ 2 2 = 0(n(m6 n /\ n ) 2 ^ a -V) 

i=l 

We now proceed similar to (14) 

n - _ n n 

Y,M%h < ^2~^Si^+s n ^2\\k\i 

i=l i=l i=l 

n , n , n n 

< £ fcSifa + ^ J2 iiAoiif - - E ii Ail? + 

i=l i=l i=l i=l 

n . n n n 

< Pm%Pm + E H^o II? + $n II Aolli + 5 n E 

i=l i=l i=l i=l 



Mi 



i=i 

since A n = Q(m8. 



□ 



Proof of Remark 2.1. If m8 m /X = o(l), then, following the proof of Theo- 
rem 2.2, the solution maximising X^=i II A 111 subject to the constraint (16) 
satisfies ||/3j||i = 0(1), and hence we have 

n _ n 

^Eift < £ A^fto + P (nX n /m + n5 n ) . 

i=l i=l 

□ 

Proof of Theorem 2.3. The proof follows that of Lemma 3.1 in Lounici et 
al. [10]. 

We start with (a) and (b). Since j3 minimizes (6), then, V/3 

n n n n 

Y,\\Yi- xjk\\ 2 2 + xJ2 UAH? <Eii y «- ^ Tft ii2 + A E H^ll?' 

i=l i=l i=l i=l 



26 



and hence, for = Xj + £j, 

n n 

J2\\ x 70i - ml < E fcWCA - A) + ACiiAii? - hah?; 



i=l 



i=l 



Denote Va = Y^j=i x ij£ £ ij ~ A/"(0, m<r 2 ), and introduce event «4j = 
nf=i{l^l — A*}; f° r some fi > 0. Then 

p(A c )<fprd>ri 

= J>[l-*{/V(avM}l 
< pexpj— /i 2 /(2mcr 2 )}. 

For .4 = n™ =1 ^4j, due to independence, 

n 

P(A C ) = J2 P ( A i) ^pnexp{-n 2 /{2ma 2 )}. 
i=i 

1 /9 

Thus, if // is large enough, P(^4 C ) is small, e.g., for fj, = aA(mlog(np)^ , 
A > y/2, we have P(^ c ) < {np) 1 '^/ 2 . 
On event .A, for some v > 0, 

E[ll^(ft-A)lll + 

t=i 

E[2/i||A-ft||i + A(||A|| 2 



i 1 



i=l 
?i m 



= EE [«Amax(||A|irMlAlli _1 )(l^|- 14-1) + (^ + 2m)|^-4'I 
1=1 j=l 

n m 

^ EE [«Amax(B a - 1 , J B Q - 1 )(|A i | - |/%|) + (y + 2^ - fa\ 

i=l j=l 

due to inequality \x a — y a \ ^ y\ max(|x| a_1 , which holds for a ^ 

1 and any x and y. To simplify the notation, denote C = a max{B a , B^ 1 ). 

Denote J { = Jfjfy = {j : fa ^ 0}, M(pi) = \J(Pi)\. For each i and 
j G J {^i), the expression in square brackets is bounded above by 

[XC + u + 2n\\fa-fa\, 
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and for j £ </ c (/3), the expression in square brackets is bounded above by 0, 
as long as v + 2/x ^ AC: 

-XC\p ij \ + (u + 2n)\Pij\ <0. 

This condition is satisfied if z/ + 2/j, ^ AC. 
Hence, on „4, for u + 2/x ^ AC, 



n n 

^[||x7(A-ft)||| + HIA-ft||i] ^^[AC + 2 M + y]||(ft-A) 



i=l 

This implies that 

n 



i=l 



^11^(^-^)112 < [AC + z, + 2 / u]||(/3-/3) 



as well as that 



ll/3-/3|li« 



Z/ ^ 



Take z/ = AC/2, hence we need to assume that 2fi AC/2: 

3A 



£p?(A-A)ll!j< 

i=l 

which implies 



-C + 2^ 



11(0-0) 



3 + 



4/x 
AC 



(17) 



\\(P-P)j\\i<4:\\(J3-P)j\ 



Due to the generalized restricted eigenvalue assumption REi(s,3, k), 
\X T (P - $)\\ 2 > Ky/m\\(P - P)j\\ 2 , and hence, using (17), 



\X T ((3-P)\\U 



3A„ 



'nM(/3)||(/3-/3) 



J||2 



K\/m 



where M(P) = max, M.(Pi), implying that 



|AC T (/3-/?)|| 2 ^ 



3A 



C + 2/i 



K\/m 
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K\/m 



3^ 

—C + 2Aa^fm\og(np) 



Also, 



|| / 3-4|| 1 <4||C9-4) J ||i<4^M||jr T G8 



IT1K 



4n.M(/3) 



3^ 

— C + 2^4(7^ rai log(rcp) 



Hence, a) and b) of the theorem are proved, 
(c) For i, I: fin ^ 0, we have 



2X i . e (Y l - Xjfr) = Aasgn CM\\Mi~ 1 , 



Hence, 



||X w x7(ft-A)||^ (^XuiYi-Xj^-UuiXi-Xjfii)^ 

> (aA||&||r72-A*) a 
= X(/3 l )(aA||ft||r 1 /2-/u) 2 . 

Thus, 

-M(A)<l|Jfi(A-A)lll- 72 

AallAlirVZ-A* 



Theorem is proved. 



□ 



Proof of Theorem 2.4- To satisfy the conditions of Theorem 2.3, we can take 
B = b and A = -^^/m log (rap). 



Thus, by Lemma A.l, 



A AAo /log (rap) 



m rj 



C^K <: Ci, 



m5 n ab - 1 ^ m \j 2eV log(n(p + l) 2 ) afr"- 1 

hence assumption A = Q(m5 n ) of Theorem 2.2 is satisfied. 
Hence, from the proof of Theorem 2.3, it follows that 



| 1 = 0((m«5 n /An) 1/{ ^ 2) ) =0 



b a-l^l/(a-2) > 
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Hence, we can take B = b and B = C [ — — ) for some C > 0, 



Vv 

and apply Theorem 2.3. Then max(l, B/B) is bounded by 



max 



1,C 



& (o-l)/(o-2)- 



r} l/{2{a-2)) 



max 



1,C 



& l/(a-2) 
^1/(2(0.-2)) 



- 



Hence, 



e 77= ^ C 2 



^l/(2(c-l)) 



^ C 2 V~ {a ~ 2)/(2ia ~ 1)) is large for small 



3aA 
2 x /m 



max(S a - 1 , S " 1 ) + 2Aa y / log(np) 



< 6ACo-A/log(np) 



+ 2Aa v / log(np) 



2 At y log (np) 



& (a-l)/(o-2) 
?? (a-l)/(2(a-2)) 

h v (a-l)/(o-2) 

1-7=] 



and, applying Theorem 2.3, we obtain (a) and (b). 
c) Apply c) in Theorem 2.3, summing over i £ X: 



^A^(ft)<ll^ T (/5 



2 mc Pmax 



< 



4sn4> n 



k 2 5 2 



1 + 3C 



(a-l)/(a-2)' 



A. 3 Proofs of Section 4 



□ 



Proof of Lemma 4.1. Let £> = X^=i a £/^|bg T ^ e * ne spectral decomposition 
of £>, where /3j*, . . . , /3£ are orthonormal M p vectors, b*, . . . , bt are orthonor- 
mal M n vectors, ai,...,Qfc > 0, and k = min{p, n}. Clearly |||i?|||i = 
X)*=i a £- Let U = Sf=i e £/^ T where ei,...,e p is the natural basis of W 1 . 
Then 



l^lb.i = || ^a^e^b| T || 2l i = ^2^ = \\\B\\\i. 
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Let B = Yl^=i e ^J where bi, 62, ■ ■ ■ , bk are orthogonal, and let U be a 
unitary matrix. Then by Schwarz inequality 



\B\ 



2.1 



Eini 

P V 

EE^ 

t=i j=i 



since 




>E^ 

i=l 

by Schwarz inequality 



since 



i=i 



which completes the proof of the (i). 

Now, consider the U defined as above for the solution of (12). Let X{ be 
the design matrices B be the solution expressed in this basis. By the first 
part of the lemma |||jS|||i = ||i3||2,i- Suppose there is a matrix B 7^ B which 
minimizes the group lasso penalty. Hence 



Y,\\^-x i f3 l \\ 2 + 



\B\ 



< 



i=i 



i=l 
n 

<Eii^ 

i=l 
n 

i=l 



+ A||£|| 2 ,i 
2 + A||£|| 2 ,i 



+ 



\B\ 



1- 



contradiction since B minimized (12). Part (ii) is proved. 



□ 



Proof of Theorem 1^.2 . Let A = Y17=i PiPj = &B T be of rank s ^ p < n, 
and hence the spectral decomposition of B can be written as B = ^£=1 a £,@l ^J T ) 
where /3*, . . . 6K P are orthonormal, and so are b\, . . . , b* S M n . Hence, 
the rotation U leading to a sparse representation UB (with s non-zero rows) 
is given by U = 5Z|=i e £/3| T ; where ei, . . . , e p is the natural basis of W. An- 
other way to write the rotation matrix is U = (fii T , . . . , /3^ T , T , . . . , T ) T . 
Denote by Us the non-zero s x p-dimensional submatrix (/3 1 T , . . . ,/3* T ) T . 
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Let A(t) = A + t(j3j3j + /3i/3 T ) + t 2 /3/3 T for some fixed i, with /3 G 
span{/3i, . . . , /3 n } = span{/3£, ...J*}. 

If (x k (t),c k (t)) is an eigen-pair of -A(t), then taking the derivative of 
xjxi = 1 yields xjii = 0, and trivially, since Xj is an eigenvector, also 
xj A±i = 0. Here'and"the first and second derivative, respectively, according 
to t. Also, we have 

Xk(t) = x k + + o(i) 
CfcCO = Cfc + *ffc + o(t) 

and 

^ + i(/3/3 4 T + /3 J; 5 T )) (a* + tu fc ) = (cfc + ti/ fc )(x fc + iu fc ) + o(t), 

where Uilij. 

Equating the 0(i) terms obtain 

,4u fc + (ppj + Pij3 T )x k = c k u k + u k x k . 

Take now the inner product of both sides with x k to obtain that 

v k = 20 T x k )(x T k fc). (18) 

Note that the null space of A(t) does not depend on t. Hence, if we call 
#B) = ll|B|||i, 

§- t m(tm=o = E | c i /2 (*)i*=o 

c k >0 

1 \ - Vk_ 

"2^ 1/2 

c fe >0 c fc 

c fc >0 

= ^ T ^ +1 /2^ = ^{BB T ) +1 I 2 ^ 

= j3 T U T s (U s BB J U J s r l l 2 U s k 

where A +l / 2 is the generalized inverse of A 1 ! 2 . 

Taking, therefore, the derivative of the target function with respect to 
fti in the directions of j3 € span{/3i, . . . ,/3 n } (e.g., in the directions j3 = 
£ = 1, . . . ,s) gives 

= m T (-2Xj(Yi - XiPi) + X(BB T ) +1 / 2 pi), or, equivalent^, 
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= U s (2Xj(Y i - Xifc) - \{BB T ) +1 ' 2 h)- 
Let R = (n, . . . , r p ) T be the matrix of projected residuals: 

m 
3=1 

Then 

UsR=^Us(BB J ) +1 ' 2 B. 

Consider again the general expansion B = £)f=i b| T . Then 1 1 \B\ \ | i = 
=1 I a d • Taking the derivative of the sum of squares part of the target 
function with respect to a% we get 

n 

£ bliPfxJ(Yi - xSi) = PfRh\. 

i=l 

Considering the sub-gradient of the target function we obtain that ^ Rb\\ < 
A/2, and ot£ = in case of strict inequality. 

□ 

Proof of Theorem 4-3 ■ (a) and (b) Similarly to the proof of Theorem 2.3, 
we have 

|| Y - X T £||2 = || Y - X T B\\ 2 + 2^ % -xT.(A - fa). 

The last term can be bounded with high probability. Introduce matrix 
M with independent columns Mi = XiEi ~ A/" p (0, ma 2 I p ), i = 1, . . . , n, since 

xfjg = m. Denote g-Schatten norm by ||| • ||| g . Using the Cauchy-Swartz 
inequality and the equivalence between £2 (Frobenius) and Schatten with 
q = 2 norms, we obtain: 

\J2£ijxJj(Pi-Pi)\ = \Y, M ^~M\ < ||B-B|| 2 ||M|| 2 = |||B-B||| 2 ||M|| 2 

ij il 

^ |||B-jB|||i||M|| 2 . 

Now, ||M||| ~ ma 2 Xn P hence it can be bounded by B 2 = ma 2 (np + c) 
(Lemma A.l, Lounici et al. [10]) with probability at least 1— exp (— | min(c, c 2 /(np))) . 
Denote this event by A. Hence, we need to choose c such that c/^/np — > 00. 
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For example, we can take c = Anp with A > 1, then B = a^J (1 + A)mnp, 
and, since min( Anp, A 2 np) = Anp, the probability is at least 1 — e ~ An P/ 2 , 

Denote by V the subspace of W corresponding to the union of subspaces 
where the eigenvalues of £>£> T are non-zero, and by Py the projection on that 
space. Then, W = V © V c and dim(V) = rank(£) < s. 

Hence, adding A 2 |||£> — to both sides, we have that on A, 

||X t (jB-jB)||1 + A 2 |||B-B|||i < A|||B|||i-A|||B|||i + (25 + A2)|||fi-B|||i 

< A|||iV^|||i - Atrace(Pv|S| + (I - Py)\B\) 
+ (2B + X 2 )\\\P V (B-B)\\\ 1 

+ (2B + X 2 )\\\(I - P V )(B - B)\\\i 

< AtracedPy^l) - A trace(Py |B|) + {2B + A 2 ) trace (\P V (B 
+ (2B + A 2 ) trace(|(J - P V )B\) - Atrace((I - P V )\B\) 

< (A + 2B + A 2 ) trace(|Py(B - 

if A ^ 25 + A 2 , since trace(|iV#|) = trace(|Py| = trace(Py Here 
\A\ = (AA 7 ) 1 / 2 . We can take, e.g. A 2 = 2B = A/2, implying that A = 
4cr- v / (1 + A)mnp. 

Hence, we have that < 2A|||Py(B-£)|||, i.e. \\\{I-P V ){B- 

B)\\\ <3A|||iV(jB-B)|||. Thus, applying RE2(s, 3, k), rank(^) < s, we have 
that 

\\X T ((3 - ml < 2A|||iV(B - 5)111! < 2\J- S \\\P V {B - B)\\\ 2 
= 2X^S\\P V {B-B)\\ 2 < ^||AT T (/3- 



hence 



K\/m 

Using this and the RE2 assumption, 



|X T (/3-/3)|| 2 ^ 2A ^ 



\B-B\Wx < 4|||iV(iB-^)|||i < i^||X T (/3-/3)|| 2 < 



Substituting the value of A, we obtain the results. 

(c) Since 7$ = U are the solution of group lasso problem with design 
matrices X{ = UXi, for £ £ ^(7): 117-flh / 0, 7^ satisfies the following 
equations; 

2Xj e (Yi - xSi) = X-, 



117. 



e\\2 
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(see also Theorem 4.2). 
Hence, 



n 

(xlt(Yi - XiPi) 



i=l 



2 _ A 2 

4 



On one hand, for £ G J(7), 



i=l 



^2[X i .iXj(p i -/3 i ) 



1/2 



.i=i 

n 

1=1 

( n 



1/2 



1/2 



1/2 



,i=l 



On event A, 

n n n 

i=i j=i i=i 

Summing over £ E J (7), we have 



I ^ = (A/4) 2 



EENft-ft)) ^^^(2-4) 



^eJ(7) i=1 
On the other hand, 



A A 



16 



(x u xj0i - A)) < En^ T (A - A)H1 = Ell^M - 



1=1 1=1 



i=l 



i=l 



^m</> max ||X T (£-£)|| 2 . 



Since rank(S) = ^(7), 

1 ia\ / W>max||X T (,8 ~ B)||l 16m^> max 4A 2 S 
rank(o) , = — = s 



(A/4) 5 



A 2 



□ 
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Proof, of Theorem 4.4. 

Using Lemma A.l, with probability at least 1 — rj, 

|Mfl-L,(ffl| < >[*^M(n + ± m l), 
r nm y m?7 i f— j' 

since n > 1. Note that if n = 1, it is sufficient to replace p by p + 1 under 
the logarithm. 

In our case, the estimators are in set B n>p . If Y!i=i fiifij = U T AU is 
the spectral decomposition, and ji = Ufa, = \ \j.kW21 7-fc are orthogonal, 
hence 

n p 

trace{^AA T } 1/2 = Ell7.fcl|2. 

i=l fe=l 
Thus, we need to bound Ya=i llftlli m terms of Ylk=i llT-felb- 



tlli 



t=l i=l 



i=l 

n 

maxM(A) V||7i|l2 

! ' ' 

1=1 

P 

maxM(/3i)V ||7^||1 

7 * ^ 



=1 

/ v \ 2 

' t\\2 



< 2maxM(A) ( ^ 111- 
SC maxM(/3j)6 2 



^=1 



since Y%=i Wl-eh < b. 

Hence, with probability at least 1 — rj, 



sup Pp (Lpifi) — L F m ) < 2 ( 1 + m ^ M(A)62 ^ / 4gF l0g( ^ } 



Note that we can use p instead of maxjM(/3j). The theorem is proved. 



□ 
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